-
Notifications
You must be signed in to change notification settings - Fork 638
[Common] MOE Split dBias #2674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Common] MOE Split dBias #2674
Conversation
dad4c68 to
aff53ff
Compare
4881d1b to
ac81c85
Compare
Greptile OverviewGreptile SummaryThis PR adds support for computing per-tensor
The implementation correctly handles different shape representations (SAME_BOTH_DIMS, VARYING_FIRST_DIM, etc.) and maintains the restriction that dbias is only supported for tensors with constant last dimension. Issues to verify:
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant API as nvte_group_quantize_dbias
participant Dispatch as group_quantize_bwd_helper
participant Kernel as group_quantize_mxfp8_kernel
participant Reduction as group_reduce_dbias_kernel
participant Output as GroupedTensor dbias
User->>API: Call with grouped input/output tensors
API->>Dispatch: Forward to dispatch layer
Dispatch->>Dispatch: Convert NVTEGroupedTensor to GroupedTensor
Dispatch->>Kernel: Launch quantization kernel
Note over Kernel: Compute partial dbias per 128x128 tile
Kernel->>Kernel: Write partial results to workspace
Kernel-->>Dispatch: Return partial dbias in workspace
Dispatch->>Reduction: Launch group_reduce_dbias_kernel
Note over Reduction: Reduce partial dbias per-tensor<br/>across chunk_dim_Y rows
Reduction->>Output: Write per-tensor dbias [num_tensors, cols]
Output-->>User: Return grouped dbias tensor
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
9 files reviewed, 1 comment
| if (!is_single_tensor) { | ||
| NVTE_CHECK(num_tensors <= MAX_SUPPORTED_TENSOR_DESCRIPTORS, | ||
| "Number of tensors in a group is larger than " | ||
| "the MAX number of supported descriptors (64)."); | ||
| // Only full tiles supported | ||
| NVTE_CHECK(last_logical_dim % CHUNK_DIM_X == 0, | ||
| "Last dimension of a grouped tensor should be divisible by 128."); | ||
| blocks = DIVUP(elts_total, CHUNK_DIM_Y * CHUNK_DIM_X); | ||
| } | ||
| const dim3 grid(blocks); | ||
|
|
||
| NVTE_CHECK(elts_total % ELTS_PER_CHUNK == 0, "Only full-tile grouped tensors supported."); | ||
| const dim3 grid(elts_total / ELTS_PER_CHUNK); | ||
| const size_t block_size = THREADS_PER_CHUNK; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regression: tail tiles disallowed
group_quantize now enforces elts_total % ELTS_PER_CHUNK == 0 and sets grid(elts_total / ELTS_PER_CHUNK) for all shapes, including the is_single_tensor case (shape_rep == SAME_BOTH_DIMS || VARYING_FIRST_DIM). In the previous implementation, the is_single_tensor path used DIVUP(first_logical_dim, 128)/DIVUP(last_logical_dim, 128), so tail tiles were accepted.
If callers previously passed non-128-multiple M/K for constant-last-dim groups, this change will now hard-fail. If dropping tail support is intentional, it should be documented and covered by tests; otherwise the grid/check logic likely needs to preserve the old DIVUP behavior for is_single_tensor.
|
/te-ci |
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
for more information, see https://pre-commit.ci
3c30d6c to
53e99c3
Compare
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
9 files reviewed, 3 comments
| NVTE_CHECK(elts_total % ELTS_PER_CHUNK == 0, "Only full-tile grouped tensors supported."); | ||
| const dim3 grid(elts_total / ELTS_PER_CHUNK); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new grid computation elts_total / ELTS_PER_CHUNK now requires all tensors (including is_single_tensor cases) to be exact multiples of 128x128 tiles. The old implementation used DIVUP for single tensors, allowing tail tiles. A test case with last_dim=160 was removed. Verify this breaking change is intentional and documented for users.
| ? (first_logical_dim / num_tensors) | ||
| : first_dims_ptr[tensor_id]; | ||
|
|
||
| const size_t rows = tensor_rows / chunk_dim_Y; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verify that tensor_rows is always divisible by chunk_dim_Y (128), otherwise this division silently truncates and skips tail row reduction.
| const size_t dbias_in_offset_Y = (shape_rep == ShapeRepresentation::SAME_BOTH_DIMS) | ||
| ? (tensor_id * (tensor_rows / chunk_dim_Y)) | ||
| : (offsets_ptr[tensor_id] / cols / chunk_dim_Y); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verify that offsets_ptr[tensor_id] / cols / chunk_dim_Y correctly computes the workspace offset for varying first dimensions.
| if (global_dim_X % CHUNK_DIM_X != 0) { | ||
| NVTE_DEVICE_ERROR( | ||
| "The grouped tensor must be divisible by 128x128 tiles without a tail tile."); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's see the performance impact of having this here.
Description
This PR adds a new kernel that computes
dbiasseparately for each tensor in a group and outputs a groupeddbiastensor containing per-tensordbiasvalues.Fixes # (issue)
Type of change
Changes
Checklist: